Feature Selection and Customer Segmentation¶

Introduction¶

In this project, the goal is to analyse the data and create customer segmentation model in the end. There is a binary variable 'converted' in the data - if customer converted or not, Which has to play the crucial role for the clustering. The approach is to find clusters that will be different by conversion rates.

Objective¶

The primary objective of this work is to identify the key features that significantly contribute to predicting the converted variable. Subsequently, these features will be leveraged to segment customers effectively.

Approach¶

  1. Feature Selection:

    • Utilize several techniques such as Recursive Feature Elimination, feature importance from machine learning models, and domain knowledge to identify crucial features.
  2. Customer Segmentation:

    • Employ clustering algorithms such as K-means to group customers based on the selected features.
  3. Model Evaluation:

    • Assess the performance of the Customer Segmentation model.
In [7]:
import sys
In [8]:
sys.executable
Out[8]:
'/home/roma/LENUS_TASK/Customer-Seg-Study/env/bin/python'
In [5]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from umap import UMAP
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from tqdm import tqdm
import plotly.graph_objects as go
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score, adjusted_rand_score, homogeneity_score, completeness_score, v_measure_score
In [9]:
# Get the current directory
current_directory = Path.cwd()
In [11]:
!ls data
customer_data_sample.csv  test
In [12]:
# The relative path to CSV file
csv_file_path = current_directory / "data" / "customer_data_sample.csv"
In [13]:
df = pd.read_csv(csv_file_path)
In [14]:
df.shape
Out[14]:
(891, 10)
In [15]:
df.head()
Out[15]:
customer_id converted customer_segment gender age related_customers family_size initial_fee_level credit_account_id branch
0 15001 0 13 male 22.0 1 0 14.5000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki
1 15002 1 11 female 38.0 1 0 142.5666 afa2dc179e46e8456ffff9016f91396e9c6adf1fe20d17... Tampere
2 15003 1 13 female 26.0 0 0 15.8500 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki
3 15004 1 11 female 35.0 1 0 106.2000 abefcf257b5d2ff2816a68ec7c84ec8c11e0e0dc4f3425... Helsinki
4 15005 0 13 male 35.0 0 0 16.1000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki
In [16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        891 non-null    int64  
 1   converted          891 non-null    int64  
 2   customer_segment   891 non-null    int64  
 3   gender             891 non-null    object 
 4   age                714 non-null    float64
 5   related_customers  891 non-null    int64  
 6   family_size        891 non-null    int64  
 7   initial_fee_level  891 non-null    float64
 8   credit_account_id  891 non-null    object 
 9   branch             889 non-null    object 
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB

1. Feature Meanings and Descriptions¶

Field Explanation
customer_id Numeric id for a customer
converted Whether a customer converted to the product (1) or not (0)
customer_segment Numeric id of a customer segment the customer belongs to
gender Customer gender
age Customer age
related_customers Numeric - number of people who are related to the customer
family_size Numeric - size of family members
initial_fee_level Initial services fee level the customer is enrolled to
credit_account_id Identifier (hash) for the customer credit account. If customer has none, they are shown as "9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0"
branch Which branch the customer mainly is associated with

Initial guessing before any analysis


In order to understand our data and make some initial guesses, we can observe each variable and try to understand their meaning. Since the goal of this task is to identify the most important features for predicting customer conversion, I would like to make some initial guesses before any analysis, based solely on common logic.


  • Variable - age.
  • Type - Demographic
  • Explanation - Since the final end user product of Linus is not a traditional product, I think some generations would appreciate it more than others.

  • Variable - initial_fee_level.
  • Type - Segmential
  • Explanation - This variable seems very similar to customer segments itself and has to be important.

  • Variable - customer_segment.
  • Type - Derived
  • Explanation - customer_segment seems like a feature created based on data.
In [17]:
# Splitting the data into train and test sets
df, test = train_test_split(df, test_size=0.2, random_state=1)
In [18]:
test_dir_path = current_directory / "data" / "test"
if not os.path.exists(test_dir_path):
    os.makedirs(test_dir_path)
test_file_path = current_directory / "data" / "test" / "test.csv"
test.to_csv(test_file_path)

1.1 Exploring features¶

In [19]:
numerical_feats = ['age', 'related_customers', 'family_size', 'initial_fee_level']
categorical_feats = ['customer_segment', 'gender', 'branch']
target = 'converted'
In [20]:
counts = df['converted'].value_counts()

# Create a bar plot with specified colors
fig = px.bar(x=counts.index, y=counts.values, color=counts.index,
             labels={'x': 'Target', 'y': 'Count'}, title='Distribution of Target Variable')
fig.update_layout(xaxis_type='category', title_x=0.5)  # Centering the title
fig.show()

Age

In [21]:
# Create two subplots for each converted category
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Plot for converted == 0
sns.histplot(df[df['converted'] == 0]['age'], kde=True, ax=axes[0], color='skyblue')
axes[0].set_title('Converted = 0')

# Plot for converted == 1
sns.histplot(df[df['converted'] == 1]['age'], kde=True, ax=axes[1], color='salmon')
axes[1].set_title('Converted = 1')

plt.tight_layout()
plt.show()
No description has been provided for this image

There are clients 0 and 1 years old ???

In [22]:
df['age'].value_counts()
Out[22]:
age
24.00    23
30.00    22
19.00    22
22.00    20
18.00    19
         ..
0.42      1
66.00     1
0.67      1
20.50     1
0.75      1
Name: count, Length: 86, dtype: int64

Even 0.83 ???

Initial fee level

In [24]:
var = 'initial_fee_level'
data = pd.concat([df.converted, df[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=target, y=var, data=data, hue='converted')
fig.axis(ymin=0, ymax=300)
Out[24]:
(-0.5, 1.5, 0.0, 300.0)
No description has been provided for this image
In [25]:
df[df['converted']==1]['family_size'].value_counts()
Out[25]:
family_size
0    178
1     57
2     31
3      2
5      1
Name: count, dtype: int64
In [26]:
df['branch'].value_counts()
Out[26]:
branch
Helsinki    513
Tampere     133
Turku        64
Name: count, dtype: int64

until now we just used intuition, it means not much.subjective...

In [28]:
corr_feats = ['converted', 'customer_segment', 'age', 'related_customers', 'family_size', 'initial_fee_level']
In [29]:
#correlation matrix
corrmat = df[corr_feats].corr()
f, ax = plt.subplots(figsize=(6, 7))
sns.set(font_scale=1.25)
sns.heatmap(corrmat, vmax=.8, square=True, cbar=True, annot=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=corr_feats, xticklabels=corr_feats);
No description has been provided for this image

Most correlated features with the target are customer_segment and initial_fee_level But they are correlated to each other too

In [31]:
counts = df.groupby(['customer_segment', 'converted']).size().reset_index(name='count')

sns.kdeplot(x=df.customer_segment, y=df.converted, cmap="Blues", fill=True, bw_adjust=0.5)

plt.yticks([0, 1])
plt.xticks([11, 12, 13])

# Add annotations
for index, row in counts.iterrows():
    plt.text(row['customer_segment'], row['converted'], row['count'], color='black', ha='center')
    
plt.show()
No description has been provided for this image

2. Missing Data¶

In [32]:
df.isnull().sum()
Out[32]:
customer_id            0
converted              0
customer_segment       0
gender                 0
age                  144
related_customers      0
family_size            0
initial_fee_level      0
credit_account_id      0
branch                 2
dtype: int64
In [33]:
df.shape
Out[33]:
(712, 10)

Two features with missing values Age and branch

Since I find this var bit strange and doesnt seems good predictor, I will not spend too much recources on that, I will just impute avg values.¶
For branch, I will impute most frequent value 'Helsinki'¶
In [35]:
df.groupby(['branch', 'converted']).size().reset_index(name='count')
Out[35]:
branch converted count
0 Helsinki 0 344
1 Helsinki 1 169
2 Tampere 0 60
3 Tampere 1 73
4 Turku 0 39
5 Turku 1 25
In [36]:
# Definition of imputation strategies for each column
imputation_strategies = {
    'age': 'mean',   
    'branch': 'most_frequent' 
}
In [37]:
age_imputer = SimpleImputer(missing_values=np.nan, strategy=imputation_strategies['age'])
branch_imputer = SimpleImputer(missing_values=np.nan, strategy=imputation_strategies['branch'])
In [38]:
preprocessor = ColumnTransformer(
    transformers=[
        ('age_imp', age_imputer, ['age']),
        ('branch_imp', branch_imputer, ['branch'])
    ])
In [39]:
df.loc[:, ['age', 'branch']] = preprocessor.fit_transform(df)
In [40]:
df.isnull().sum()
Out[40]:
customer_id          0
converted            0
customer_segment     0
gender               0
age                  0
related_customers    0
family_size          0
initial_fee_level    0
credit_account_id    0
branch               0
dtype: int64

3. Outliers treatment, Univariate approach¶

In [41]:
df.head()
Out[41]:
customer_id converted customer_segment gender age related_customers family_size initial_fee_level credit_account_id branch
301 15302 1 13 male 30.166232 2 0 46.5000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Turku
309 15310 1 11 female 30.000000 0 0 113.8584 e70ba215a23e2c438f86bc8ddf119c579b7bff180841c6... Tampere
516 15517 1 12 female 34.000000 0 0 21.0000 16ee13fe0dd987f3ef966e930adebd1e4f5d40f6180ac7... Helsinki
120 15121 0 12 male 21.000000 2 0 147.0000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki
570 15571 1 12 male 62.000000 0 0 21.0000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki
In [42]:
# Defining Interquartile Range
q1 = df.age.quantile(0.25)
q3 = df.age.quantile(0.75)
iqr = q3 - q1
up = q3 + 100 * iqr
low = q1 - 1.5 * iqr
In [43]:
df[(df.age < low) | (df.age > up)].shape
Out[43]:
(6, 10)

Since I don't see big potential in Age variable, I dont want to remove lots of data point and at the same time I want to keep the Age

In [45]:
sns.set_style("whitegrid")

# Create the plot
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='initial_fee_level', bins=40, kde=True, color='lightblue', edgecolor='black')

# Add labels and title
plt.xlabel('initial_fee_level', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('initial_fee_level Distribution', fontsize=16)

# Add grid
plt.grid(True, linestyle='--', alpha=0.7)

# Show plot
plt.show()
No description has been provided for this image

No normal distribution !!! therefore, can be capped from the top

In [47]:
sns.boxplot(df["initial_fee_level"])
Out[47]:
<Axes: ylabel='initial_fee_level'>
No description has been provided for this image
In [48]:
# Defining Interquartile Range for initial_fee_level
low = 0
up = 400
In [49]:
df[(df.initial_fee_level < low) | (df.initial_fee_level > up)].shape
Out[49]:
(17, 10)

I Want to remove around ~ 2% of data that is `~ 17 rows.


In [51]:
df = df[df.initial_fee_level <= up]
In [52]:
df.shape
Out[52]:
(695, 10)

4. Create new features¶

In [53]:
no_credit_account_value = '9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0'
In [54]:
df.loc[:, 'has_credit_account'] = df['credit_account_id'].apply(lambda x: 0 if x==no_credit_account_value else 1)
In [55]:
fig = px.histogram(df, x=df['has_credit_account'], color=df['has_credit_account'])
fig.show()
/home/roma/LENUS_TASK/Customer-Seg-Study/env/lib/python3.9/site-packages/plotly/express/_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

5. Feature encoding¶

In [56]:
df.head()
Out[56]:
customer_id converted customer_segment gender age related_customers family_size initial_fee_level credit_account_id branch has_credit_account
301 15302 1 13 male 30.166232 2 0 46.5000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Turku 0
309 15310 1 11 female 30.000000 0 0 113.8584 e70ba215a23e2c438f86bc8ddf119c579b7bff180841c6... Tampere 1
516 15517 1 12 female 34.000000 0 0 21.0000 16ee13fe0dd987f3ef966e930adebd1e4f5d40f6180ac7... Helsinki 1
120 15121 0 12 male 21.000000 2 0 147.0000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki 0
570 15571 1 12 male 62.000000 0 0 21.0000 9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6... Helsinki 0
In [57]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()
In [58]:
# Fit LabelEncoder
label_encoder.fit(df['gender'])
Out[58]:
LabelEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LabelEncoder()
In [59]:
# Transform
df.loc[:, 'gender'] = label_encoder.transform(df.gender)
In [60]:
# Labelencode 'branch' column by population of those cities , or by value counts
In [61]:
# Define custom order
custom_order = {'Turku': 0, 'Tampere': 1, 'Helsinki': 2}
In [62]:
df['branch'].value_counts()
Out[62]:
branch
Helsinki    507
Tampere     124
Turku        64
Name: count, dtype: int64
In [63]:
df.loc[:, 'branch'] = df['branch'].map(custom_order)

6. Feature selection¶

By carefully selecting important features, we enhance the accuracy of our Customer Segmentation model, ensuring meaningful and actionable insights for targeted business strategies.

In [66]:
modeling_feats = ['customer_segment', 'gender', 'age', 'related_customers', 'family_size', 
                  'initial_fee_level', 'branch', 'has_credit_account']

6.1 Scaling¶

In [67]:
scaler = StandardScaler()
scaler.fit(df[modeling_feats])
print('variables mean values: \n' + 90*'-' + '\n' , scaler.mean_)
scaled_matrix = scaler.transform(df[modeling_feats])
variables mean values: 
------------------------------------------------------------------------------------------
 [12.3323741   0.65611511 30.14104317  0.48776978  0.35251799 53.09461439
  1.63741007  0.21582734]
In [68]:
scaled_feats = ['scaled_'+feat for feat in modeling_feats]
In [69]:
df.loc[:, scaled_feats] = scaled_matrix
In [70]:
# Define transformation function for test data
In [71]:
test = pd.read_csv(test_file_path)
In [72]:
def preprocess_data(data, 
                    imputation_strategies, 
                    preprocessor, 
                    label_encoder, 
                    scaler):
    # missing values
    data.loc[:, ['age', 'branch']] = preprocessor.fit_transform(data)
    # new feature
    data.loc[:, 'has_credit_account'] = data['credit_account_id'].apply(lambda x: 0 if x==no_credit_account_value else 1)
    # label encoding
    data.loc[:, 'gender'] = label_encoder.transform(data.gender)
    data.loc[:, 'branch'] = data['branch'].map(custom_order)
    # scale
    scaled_matrix = scaler.transform(data[modeling_feats])
    data.loc[:, scaled_feats] = scaled_matrix
    
    return data
    
In [73]:
test = preprocess_data(test, imputation_strategies, preprocessor, label_encoder, scaler)

PCA Analysis - its always important to check how many pcs explain 80% of variance

In [75]:
pca = PCA()
pca.fit(scaled_matrix)
pca_samples = pca.transform(scaled_matrix)
In [76]:
scaled_matrix.shape
Out[76]:
(695, 8)
In [77]:
pca = PCA()
pca.fit(scaled_matrix)

# Get the explained variance ratio for each principal component
explained_variance_ratio = pca.explained_variance_ratio_

# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

# Generate the plot
plt.figure(figsize=(10, 6))

# Plot explained variance ratio with customized colors
bars = plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, align='center', color='skyblue', label='Explained Variance Ratio')

# Plot cumulative explained variance with a different color
plt.plot(range(1, len(explained_variance_ratio) + 1), cumulative_explained_variance, color='orange', marker='o', linestyle='-', linewidth=2, label='Cumulative Explained Variance')

# Add values on top of each bar
for i, bar in enumerate(bars):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), f'{explained_variance_ratio[i]:.2f}', ha='center', va='bottom')

plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
plt.title('Explained Variance and Cumulative Explained Variance by Principal Component')
plt.xticks(range(1, len(explained_variance_ratio) + 1))
plt.legend()
plt.grid(True)

# Customize background color
plt.gca().set_facecolor('lightgrey')

plt.show()
No description has been provided for this image

UMAP dimentionality reduction plot before feature selection , all features included

In [79]:
features = scaled_matrix

umap_2d = UMAP(n_components=2, init='random', random_state=0)
umap_3d = UMAP(n_components=3, init='random', random_state=0)

proj_2d = umap_2d.fit_transform(features)
proj_3d = umap_3d.fit_transform(features)

fig_2d = px.scatter(
    proj_2d, x=0, y=1,
    color=df.converted, labels={'color': 'converted'}
)
fig_3d = px.scatter_3d(
    proj_3d, x=0, y=1, z=2,
    color=df.converted, labels={'color': 'converted'}
)
fig_3d.update_traces(marker_size=5)

fig_2d.show()
fig_3d.show()
/home/roma/LENUS_TASK/Customer-Seg-Study/env/lib/python3.9/site-packages/umap/umap_.py:1943: UserWarning:

n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.

/home/roma/LENUS_TASK/Customer-Seg-Study/env/lib/python3.9/site-packages/umap/umap_.py:1943: UserWarning:

n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.

6.2 RFE: Recursive Feature Elimination¶

Overview:

RFE (Recursive Feature Elimination) is a technique used for selecting the most important features from a dataset. It operates by iteratively training a model, ranking features by their importance, and eliminating the least important ones until the desired number is reached. This iterative process enhances model efficiency and interpretability.

Key Steps:

  1. Model Training: Train a model on the entire feature set.
  2. Feature Ranking: Rank features based on their importance scores.
  3. Feature Elimination: Recursively eliminate the least important features.
  4. Stopping Criterion: Stop when the desired number of features is reached or a predetermined criterion is met.

RFE helps streamline the feature selection process, leading to improved model performance and easier interpretation of results.

In [80]:
# Initialize the estimator
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
In [81]:
# Perform Recursive Feature Elimination (RFE)
# In this example, we specify to keep 5 features
rfe = RFE(estimator, step=1)
In [82]:
rfe.fit(scaled_matrix, df.converted)
Out[82]:
RFE(estimator=RandomForestClassifier(n_estimators=50, random_state=42))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RFE(estimator=RandomForestClassifier(n_estimators=50, random_state=42))
RandomForestClassifier(n_estimators=50, random_state=42)
RandomForestClassifier(n_estimators=50, random_state=42)
In [83]:
# Selected features after RFE
selected_features = rfe.support_
print("Selected Features:\n")
for feature, selected in zip(modeling_feats, selected_features):
    print(f"{feature}: {'Selected' if selected else 'Not Selected'}")
Selected Features:

customer_segment: Selected
gender: Selected
age: Selected
related_customers: Not Selected
family_size: Not Selected
initial_fee_level: Selected
branch: Not Selected
has_credit_account: Not Selected
In [84]:
# Feature ranking after RFE (ranking of features by importance)
feature_ranking = rfe.ranking_
print("\nFeature Rankings:\n")
for feature, rank in zip(modeling_feats, feature_ranking):
    print(f"{feature}: {rank}")
Feature Rankings:

customer_segment: 1
gender: 1
age: 1
related_customers: 3
family_size: 4
initial_fee_level: 1
branch: 5
has_credit_account: 2
In [85]:
# Obtain feature importances from the estimator (Random Forest)
feature_importances = rfe.estimator_.feature_importances_
print("\nFeature Importances:\n")
for feature, importance in zip(modeling_feats, feature_importances):
    print(f"{feature}: {importance}")
Feature Importances:

customer_segment: 0.10554244028154942
gender: 0.26695115741618053
age: 0.30036380140865826
related_customers: 0.3271426008936118

6.3 Exploring Multiple Models for Feature Importance¶

Overview:

In this approach, I explored various models to determine feature importance. I selected three models: Support Vector Machine (SVM), Random Forest, and Linear Regression. Utilizing grid search, identified the best parameters for each model and calculated the average feature importances across the best models. Ultimately, obtained a consolidated view of feature importances derived from three distinct types of models.

Key Steps:

  1. Model Selection:

    • Chose three different models: SVM, Random Forest, and Linear Regression.
  2. Grid Search:

    • Performed grid search to find the best hyperparameters for each model.
  3. Feature Importance Calculation:

    • Calculated the average feature importances from the best models of each type.

This approach provides a comprehensive understanding of feature importance, leveraging insights from multiple modeling techniques.

In [87]:
X, y = scaled_matrix, df.converted
In [88]:
# Define models
models = {
    #'SVM': SVC(),
    'Random Forest': RandomForestClassifier(),
    'Linear Classifier': LogisticRegression(max_iter=1000)
}

# Define parameter grids for tuning
param_grids = {
    #'SVM': {'C': [0.1, 1, 10], 'gamma': [0.1, 0.01], 'kernel': ['linear', 'rbf']},
    'Random Forest': {'n_estimators': [50, 200], 'max_depth': [3, 5, 10]},
    'Linear Classifier': {'C': [0.1, 1, 10]}
}

# Dictionary to store best parameters
best_params = {}

# Dictionary to store feature importances
feature_importances = {}
In [89]:
# Loop over models
for name, model in models.items():
    print(f"Training {name}...")
    # Initialize progress bar
    with tqdm(total=len(param_grids[name]), desc=f"{name} - Parameter Grid Search") as pbar:
        # Loop over parameter grid
        for param_set in param_grids[name]:
            # Perform cross-validation and parameter tuning
            clf = GridSearchCV(model, param_grids[name], cv=5)
            clf.fit(X, y)
            best_params[name] = clf.best_params_

            # Train model with best parameters
            best_model = clf.best_estimator_
            best_model.fit(X, y)

            # Extract feature importances
            if name == 'Linear Classifier':
                importance = np.abs(best_model.coef_[0])
            else:
                importance = best_model.feature_importances_

            feature_importances[name] = importance

            # Update progress bar
            pbar.update(1)
            pbar.set_postfix({'Best Params': clf.best_params_})
Training Random Forest...
Random Forest - Parameter Grid Search: 100%|█| 2/2 [00:12<00:00,  6.04s/it, Best
Training Linear Classifier...
Linear Classifier - Parameter Grid Search: 100%|█| 1/1 [00:00<00:00, 12.87it/s, 
In [90]:
# Averaging feature importances
avg_importance = np.mean([value for value in feature_importances.values()], axis=0)

# Create dictionary with averaged feature importances
avg_feature_importances = {f'Feature_{i}': avg_importance[i] for i in range(len(avg_importance))}
In [91]:
fig = go.Figure(go.Bar(
    x=modeling_feats,
    y=avg_importance,
    marker=dict(color='rgb(26, 118, 255)')
))

# Update layout for better visualization
fig.update_layout(
    title='Feature Importances',
    xaxis=dict(title='Features', tickangle=45),
    yaxis=dict(title='Importance'),
    template='plotly_white',
    height=600  # Adjust the figure height as needed
)

# Show the plot
fig.show()
In [93]:
counts = df.groupby(['gender', 'converted']).size().reset_index(name='count')

sns.kdeplot(x=df.gender, y=df.converted, cmap="Blues", fill=True, bw_adjust=0.5)

plt.yticks([0, 1])
plt.xticks([0, 1])

# Add annotations
for index, row in counts.iterrows():
    plt.text(row['gender'], row['converted'], row['count'], color='black', ha='center')
    
plt.show()
No description has been provided for this image
In [94]:
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
In [95]:
label_mapping
Out[95]:
{'female': 0, 'male': 1}
In [96]:
clustering_feats = ['scaled_customer_segment', 'scaled_gender', 'scaled_age', 'scaled_related_customers', 
                  'scaled_initial_fee_level', 'scaled_has_credit_account']
In [97]:
clustering_feats_indices = [True if feat in clustering_feats else False for feat in df.columns]

7. clustering algo¶

In [98]:
X_train, X_test = df.loc[:, clustering_feats], test.loc[:, clustering_feats]

choose optimal K

In [100]:
# Create the K means model for different values of K
def try_different_clusters(K, data):

    cluster_values = list(range(1, K+1))
    inertias=[]

    for c in cluster_values:
        model = KMeans(n_clusters = c,init='k-means++',max_iter=400,random_state=42)
        model.fit(data)
        inertias.append(model.inertia_)

    return inertias
In [101]:
# Find output for k values between 1 to 12 
outputs = try_different_clusters(12, X_train)
distances = pd.DataFrame({"clusters": list(range(1, 13)),"sum of squared distances": outputs})
In [102]:
# Finding optimal number of clusters k
figure = go.Figure()
figure.add_trace(go.Scatter(x=distances["clusters"], y=distances["sum of squared distances"]))

figure.update_layout(xaxis = dict(tick0 = 1,dtick = 1,tickmode = 'linear'),
                  xaxis_title="Number of clusters",
                  yaxis_title="Sum of squared distances",
                  title_text="Finding optimal number of clusters using elbow method")
figure.show()

Elbow rule - optimal K seems 5

In [104]:
# Initialize and fit KMeans on the train set
n_clusters = 5
clusterer = KMeans(n_clusters=n_clusters, random_state=42)
train_clusters = clusterer.fit_predict(X_train)
In [105]:
model = KMeans(n_clusters=n_clusters, random_state=42)
model.fit(df[['age', 'customer_segment']])
Out[105]:
KMeans(n_clusters=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=42)
In [106]:
# Predict clusters for test set
test_clusters = clusterer.predict(X_test)
In [107]:
train = df.copy()
In [108]:
# Add the predicted clusters to the test DataFrame
train.loc[:, 'predicted_cluster'] = train_clusters
test.loc[:, 'predicted_cluster'] = test_clusters
In [109]:
import matplotlib.pyplot as plt

# Calculate conversion rates
conversion_rates = train.groupby('predicted_cluster')['converted'].mean()

# Sort clusters by increasing conversion rates
conversion_rates_sorted = conversion_rates.sort_values()

# Plot the conversion rates
ax = conversion_rates_sorted.plot(kind='bar')
plt.title('Conversion Rate by Predicted Cluster')
plt.xlabel('Predicted Cluster')
plt.ylabel('Conversion Rate')

# Add number of points for each bar
for i, v in enumerate(conversion_rates_sorted):
    ax.text(i, v + 0.01, f'{train["predicted_cluster"].value_counts()[conversion_rates_sorted.index[i]]}', ha='center')

plt.show()
No description has been provided for this image
In [110]:
# Calculate conversion rate in each cluster for train set
train_conversion_rates = train.groupby('predicted_cluster')['converted'].mean()

# Sort clusters by increasing conversion rates for train set
train_conversion_rates_sorted = train_conversion_rates.sort_values()

# Calculate conversion rate in each cluster for test set
test_conversion_rates = test.groupby('predicted_cluster')['converted'].mean()

# Sort clusters by increasing conversion rates for test set
test_conversion_rates_sorted = test_conversion_rates.sort_values()

# Set the width of the bars
bar_width = 0.35

# Set the positions of the bars on the x-axis
r1 = np.arange(len(train_conversion_rates_sorted))
r2 = [x + bar_width for x in r1]

# Plot the conversion rates for both train and test sets
plt.bar(r1, train_conversion_rates_sorted, color='b', width=bar_width, label='Train')
plt.bar(r2, test_conversion_rates_sorted, color='r', width=bar_width, label='Test')

plt.title('Conversion Rate by Predicted Cluster')
plt.xlabel('Predicted Cluster')
plt.ylabel('Conversion Rate')
plt.xticks([r + bar_width/2 for r in range(len(train_conversion_rates_sorted))], train_conversion_rates_sorted.index, rotation=45)
plt.legend()
plt.show()
No description has been provided for this image
In [111]:
def evaluate_clustering(model, test_data):
    # Predict clusters on the test data
    predicted_labels = model.predict(test_data)
    
    # Silhouette Score
    silhouette = silhouette_score(test_data, predicted_labels)
    print("Silhouette Score:", silhouette)
In [112]:
evaluate_clustering(clusterer, X_test)
Silhouette Score: 0.39507667082064724

Conclusion¶

Key Steps¶

Feature Analysis¶

Based on variable distributions tried to learn features from technical and business point of view.

Top Features Selection¶

Utilizing some algorithms and methodologies, a list of top features for effective customer segmentation was created. These features serve as essential components in understanding customer preferences and behavior patterns.

Data Clustering¶

Kmeans clustering technique was used to uncover hidden patterns and structures within the dataset. This step was the last step for developing a customer segmentation.

Conclusion¶

In conclusion, the customer segmentation model integrates feature analysis, top feature selection, and data clustering to provide valuable insights into customer behavior. By leveraging the customer_segment feature for training, the model enhances the understanding of customer segmentation dynamics.

In [ ]: